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Abstract 

This paper studies the partial estimation of Gaussian graphical models from high-dimensional 
empirical observations. We derive a convex formulation for this problem using £i-regularized 
maximum-likelihood estimation, which can be solved via a block coordinate descent algorithm. 
Statistical estimation performance can be established for our method. The proposed approach 
has competitive empirical performance compared to existing methods, as demonstrated by var- 
ious experiments on synthetic and real datasets. 

1 Introduction 

Given n independent copies {Z^^^}f^^ of a random vector Z €z M.^ with an unknown covariance 
matrix S, the problem of precision matrix (inverse covariance matrix) estimation is to estimate 
= In particular, for multivariate normal data, the precision matrix induces the underlying 
Gaussian graphical structure among the variables. For such Gaussian graphical models (GGMs), it 
is usually assumed that a given variable can be predicted by a small number of other variables. This 
assumption implies that the precision matrix is sparse. Therefore estimating Gaussian graphical 
model can be reduced to the problem of estimating a sparse precision matrix. 

One a pproach to spar s e precision matrix estimat i on is covariance selection or neighborhood 
selection ( Dempster . 1972 : Meinshausen Sz Biihlmann . 20061 ). which tries to estimate each row (or 
column) of the precision matrix by predicting the corresponding variable using a sparse linear combi- 
nation of other variables. An alternative formulation is maximum- likelihood estimation method that 
directly estimate the full precision matrix. The sparseness of the precision ma trix can be achieved 
by adding sparse pena l ty functions such as th e £i-penalty or the SCAD penalty (jd'Aspremont et al 
20081 : iFriedman et~all . I2OO8I : iFan et al.l . I2OO9I ) . 



In this paper, we are interested in the problem of estimating blockwise partial precision matrix. 
Given n independent copies {Y^^^; X^'^')}^^^ of a random vector Z = {Y;X) G MP x M.'' with an 
unknown precision matrix 



^yy ^yx 



"yx 



1 



our goal is to simultaneously estimate the blocks ^lyy and flyx, without attempting to estimate 
the block ^xx- If the joint distribution of Z = {Y;X) is normal, then Qyy has an interpretation 
of conditional precision matrix of Y conditioned on X, and Qyx induces the mutual conditional 
dependency between these two groups of variants. In machine learning applications where Y is the 
response and X is the input feature, estimating partial precision matrix can be a useful tool for 
constructing graphical models for the response conditioned on the input. For instance, in multi- 
label image annotation, the response Y is the indicator vector of annotation and the input X is 
the associated image feature vector. In this case, i^yy induces a Gaussian graphical model for the 
tags while Qyx identifies the conditional dependency between tags and features. If we are mainly 
interested in the conditional precision matrix 0,yy and the interaction matrix i^yx, then it is natural 
to ignore i^xx- Consequently, we should not have to impose any assumption on the structure of 

^XX' 

Although the existing algorithms for GGMs can be used to estimate the full precision matrix 
0, and consequently its blocks ^lyy and i^yx, it requires an accurate estimation of ^xx', in order to 
estimate Qxx in high dimension, we have to impose the assumption that fi^x is sparse; and the 
degree of its sparsity affects the estimation accuracy of ^lyy and fiyx- Moreover, when q is much 
larger than p, computational procedures for the full GGMs formulation do not scale well with 



respe ct to ^Ixx- For example, the computational complexity of graphical Lasso (iFriedman et al. 



20081 ). a representative GGMs solver, for estimating O is 0{{p + q)^). This complexity is dominated 



by q when q ^ p and thus can be quite inefficient when q is large. Unfortunately, it is not 
uncommon for the feature dimensionality of modern datasets to be of order 10^ ~ 10^. Taking 
document analysis as an example, the typical size of bag-of-word features is of the order 10^. In 
web data mining, the feature dimensionality of a webpage is typically of the order 10^ ~ 10^. In 
contrast, the dimensionality of the response Y, e.g., the number of document categories, is usually 
of a much smaller order 10^ ~ 10^. The purpose of this paper is to develop a formulation that 
directly estimates the precision matrix blocks i}yy and i}yx without explicit estimation of the block 

^XX' 

To estimate the underlying graphical model of Y, one might consider applying existing GGMs 
to the marginal precision matrix Clyy = ^yy- However, this approach ignores the contribution of 
X in predicting Y, and from a graphical model point of view, the margi nal precision niatrix 17,„, 



may be dense. Taking the expression quantitative trait loci (eQTL) data (jjansen &: Napl . l200ll l as 
an example, if two genes in Y are both regularized by the same genetic variants in X at the gene 
expression level, then there should not be any dependency of these two genes. However, without 
taking the genetic effects of X into consideration, a link between these two genes is expected. 

We introduce in this paper a new sparse partial precision matrix estimation model that si- 
multaneously estimates the conditional precision matrix ^lyy and the block matrix Qyx under the 
assumption that there are many zeros in both matrices. The key idea is to drop the ii regular- 
ization for the i^xx part in the full GGMs formulation; as we will show, this leads to a convex 
formulation that does not depend on ^xx, and consequently, we do not have to estimate ^xx- Nu- 
merically this idea allows us to solve the reformulated problem more efficiently. We propose an 
efficient coordinate descent procedure to find the global minimum. The computational complexity 
is 0{p^ + p'^q + pqmm{n, q}), where n is the sample size. Statistically, we can obtain convergence 
results for Qyx and 0,yy in the high dimensional setting even though we do not impose sparsity 
assumption on ^Ixx- 

Although derived in the context of GGMs, our method is immediately applicable to the problem 
of multivariate regression with unknown noise covariance. This observation establishes the connec- 



2 



tion between our method and the conditional GGM proposed bv lYin &: which estimates 

conditional precision matrix Qyy via multivariate regression. However, the conditional graphical 
model formulation derived there is quite different from the partial graphical model formulation of 
this paper. In fact, the resulting formulations are different: we impose the sparsity assumption on 
^lyx, which leads to a convex formulation, while they impose the sparsity assumption on 0, 
which leads to a non-convex formulation. 

In sum mary, our method has the following merits compared to the standard GGMs and the 
method bv lYin fc Lil (|201lh : 



Convexity: We estimate partial precisio n matrix via sol ving a convex optimization problem. 
In contrast, the formulation proposed by Yin &: lI ( 2011 ) for a similar purpose is non-convex 
and thus the global minimum cannot be guaranteed. 

Scalability: The proposed approach directly estimates the blocks VLyy and ^yx by optimizing 
out the block of ^xx- This leads to improved scalability with respect to the dimensionality of 
X in comparison to the standard GGMs formulation that estimates the full precision matrix. 

Interpretability: For normal data, the sparsity constraint on Vtyx in our formulation has a 
natural interpretation in terms of the co nditional deperi dency between the variables in X and 
Y . This differs from the assumption in ( Yin &: Li . 201 ll ) that essentially assumes the sparsity 



of QyyQyx which does not have natural graphical model interpretation. 

Theoretical Guarantees: Theoretical performance of our estimator can be established 
without the sparsity assumption on il.xx- 



1.1 Related Work 

Numerous methods have been proposed for sparse precision matrix estimation in recent years. 
For GGMs estimation, a popular form ulation is maximum likelihood estimation with ^i -penalty 



on th e entries of the precision matrix (lYuan Sz Linl . 120071 : iBanerjee et al.l . 
20081 ). The £i -penalty leads to sparsity, and the resulta nt problem is convex. 



20081 : iRothman eTal 



antees of this type of methods have been investigated by iRavikumar et al.l (120111): iRothman et al 



T heoretical guar- 



( 20081). and its computation has been extensively studied in the literature ( d'Aspremont et al 



20081 : iFriedman et al.l . l2008l : O l2009l l. Non-convex formulations have also been considered be- 
cause it is known that ii^^en^iy suffers from a so-call e d bia s problem that can be remedied using 



non-convex penalties ( Fan et al. . 20091 : Johnson et al. . 2012 ). As an alternative approach to the 



maximum likelihood formulation, one may directly estimate the support (that is, nonzero entries) 
of the sparse precision matri x using separate neighborhood estimatiq r is for each variable fo llowed 

by a proper aggregation rule (iMeinshausen &: Biihlmannl . l200f)l : lYuanl . l20inl : ICai et al.l . l201lh . 

The conditional precision matrix i}yy is related to the latent Gaussian Graphical model of (jChandrasekaran et al 



2010l ). where Y is observed and X are unobserved hidden variables. If we further assume that X 
is low-dimensional (which is different from the situation of observed high dimensional X in this 
paper), then the we may write the marginal precision matrix Qyy using the Schur complement as 

This exhibits a sparse low-ran k structure because i}yy is sparse and the 



yy 



a 



yy 



Q O-IOT 
iLyxi^xx ^^yx- 



dimensionality of X is low. IChandrasekaran et al.l (120101 ) explored such a sparse low-rank structure 
and proposed a convex minimization method to recover i}yy as well as the low-rank component. 
Although the model is more accurate than standard GGMs, the formulation does not take advan- 
tage of the additional information provided by X when it is observed. Another issue is that this 
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latent Gaussian graphical model assumes that the hidden variable X is of low dimension, which 
may not be realistic for many applications. 



Our approach is also closely related to the conditional Gaussian graphical model (cGGM) (jYin &: Li . 



201 ll ) studied in the context of eQTL data analysis. The cGGM assumes a sparse multivariate re- 
gression model between Y and X with (unknown) sparse error precision matrix. However, the 
log-likelihood objective function associated with the model is non-convex. Their theoretical analy- 
sis applies for a local minimum solution w hich may iiot be the solution found by the algorithm. The 
cGGM model has also been considered in (|Cai et al.l . boid ^. The authors proposed to first estimate 



the linear regression parameters by multivar iate Dantz i g-sele ctor and then estimate the conditional 



precision matrix by the CLIME estimator (jCai et al.l . 120111 ). The rate of convergence for such a 



two-stage estimator was analyzed. Different from cGGM, our partial precision matrix estimation 
approach directly estimates the blocks of the full precision matrix via a convex formulation. This 
significantly simplifies the computational procedure and statistical analysis. Particularly, when 
Y is univariate, our m odel reduces to the £i-penalized maximum likelihood estimation studied 
bv lStadler et al.l(|20inl ) for sparse linear regression. For multivar i ate ra ndom vector Y , our method 



can be regarded aTImultivariate generahzation of lstadler et al.l » for sparse hnear regression 
with unknown noise covariance. 

1.2 Notation 

In the following, is a positive semi-definite matrix: ^ 0; x G is a vector; A G W^^ is a 
matrix. The following notations will be used in the text. 

Amin(f^): the smallest eigenvalue of il. 

Amax(f^): the largest eigenvalue of $7. 

Vt~: the off-diagonals of il. 

Xi'. the iih entry of a vector. 



||x||2 = V x: the Euclidean norm of vector x 
\\x\\i = Ylt=i the ^i-norm of vector x 
\\x\\q: the number of nonzero of x. 

Aij-. the element on the ith. row and jth. column of matrix A. 

Ai.: the ith. row of A. 

A.j-. the jth. column of A. 

\A\oQ = maxi<j<p^i<j<q \Aij\: ^oo-norm of A. 

\A\i = X^iLx Si=i element-wise £i-norm of matrix A. 

\\A\\i = maxi<j<q the matrix £i-norm of A. 



^\\f = \/Ya=i Z]?=i ^ij- the Frobenius norm of matrix A. 
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• ||j4||2 = sup|j^||2<i ||ylx||2: the spectral norm of matrix A. 

• supp(74) = : Aij 7^ 0}: the support (set of nonzero elements) of A. 

• /: the identity matrix. 

• S: the complement of an index set S. 

1.3 Outline 

The remaining of this paper is organized as follows: Section [2] introduces the partial Gaussian 
graphical model (pGGM) formulation; its statistical property in the high dimensional setting is 
analyzed in Section [3l Section S] presents a coordinate descent algorithm which can be used to 
solve pGGM. The extension of the proposed method to multivariate regression with unknown 
covariance is discussed in Section O Monte-Carlo simulations and experimental results on real data 
are given in Section [6j Finally, we conclude this paper in Section [71 



2 Sparse Partial Precision Matrix Estimation 
2.1 Gaussian Graphical Model 

Suppose that two random vectors Y £ W and X £ are jointly normally distributed with 
zero-mean, i.e., Z = (Y;X) ~ AA(0, S*). Its density is parameterized by the precision matrix 
n* = ^ as follows: 

6(z; n*) = ^ exp | --z~^Vt*z 

^ ' V(27r)P+'?(detf)*)-i I 2 

It is well known that the conditional independence between Zi and Zj given the remaining variables 
is equivalent to Q.*^ = 0. Let G = (V, E) be a graph representing conditional independence relations 
between components of Z. The vertex set V has p + q elements corresponding to Zi = li, Zp = 
Yp,Zp^i = Xi, Zp_^q = Xq, and the edge set E consists of ordered pairs where £ E 

if there is an edge between Zi and Zj. The edge between Zi and Zj is excluded from E if and 
only if Zi and Zj are independent given {Zf^, k ^ Thus for normal distributions, learning the 

structure of graph is equivalent to estimating the support of the precision matrix ri*. 

Suppose we have n independent observations {Z^*) = X^^^)}f^-^ from the normal distribu- 



tion 7V(0, S*). Let 



yy yx 

yinT yn 
yx ^xx 



be the empirical covariance matrix in which 



i=l 1=1 i=l 



The negative of the logarithm of the likelihood function corresponding to the GGMs is written by 

L{n) := -logdetJ^ + 

It is well-known that L{^}) is convex when ^7 ;^ 0, which implies that it is jointly convex with respect 
to the blocks ^yy, ^yx and The goal of GGMs learning can be reduced to the problem of 
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estimating the precision matrix Q* with extra sparsity constraints. In particular, the fohowing £i- 
regulari zed maximum-l i kehho od method is the most popular formulation to learn sparse precision 
matrix ( Baneriee et al. . 20081 ): 



n = aTgmm{L{n) + Xn\n-\i}, (2.1) 
where is the strength parameter of the penalty. 
2.2 Partial Gaussian Graphical Model 

We now present a new maximum-likelihood formulation for the partial GGM (pGGM) that only 
aims at estimating the blocks il.*y and 0*^ instead of estimating the full precision matrix 0*. 
Without causing confusion, we can write L{Q) as L{Qyy,i},yx,^xx)- The basic idea of pGGM is 
to eliminate Qxx by optimizing L{Qyy,Qyx,i^xx) with respect to Qxx, and this can be achieved 
if we do not impose any sparsity constraint on Qxx- As we will show in the following, this idea 
allows us to decouple the estimation of Qxx from the estimation of {^yy, ^yx}- This not only allows 
faster computation, but also allows us to develop a theoretical convergence analysis for {Qyy,Qyx} 
without assuming the sparsity of i^xx^ 

We introduce a reparameterization Qxx '■= ^xx — ^yx^yy^yx- Note that ;^ implies Qxx >~ 0. 
The following proposition indicates that with such a reparameterization, L can be decomposed as 
the sum of a component only dependent on {^yy, ^yx} and a component only dependent on ^xx- 

Proposition 1. Under the transformation 0,xx = ^xx — ^yx^yy^yx we have 

^(.^yyi^yx^^xx) — -^(^yy; ^yx? ^xa?) — Lpa(S^yyi^yx) ~^ -^(S^xx^^ (^•^) 

where H{VLxx) = — log det + triiy^^VLxx) and 

Lpai^yy, ^yx) ■= " log det(J7^,) + tr (S^^^J^j,,) + 2tr(S«JJ7^,) + tT{J:^,n]^^nyy'nyx). (2.3) 
Moreover Lpa{i^yy,0,yx) is convex. 

The proof of Proposition [1] is provided in Appendix lA.li Since both Lpa(r2j;y, ^yx) and H{^lxx) 
are convex, we have that L(^^yy^^yx^^xx 

) is jointly convex in {i^yy, Qyx,^xx}- 
The decomposition formulation (12.20 is the key idea key behind our new formulation which 
decouples the optimization of {^yy, ^yx} and ^xx- In the high dimensional setting, we consider the 
following penalized problem using the reparameterized 

{^yy^^yxi^xx^ — arg miu ^L(^^yy^^yx^^xx^~^-^(S^yy^^yx^~^-P(S^xx^^^ (^•^) 

where RiVtyy, flyx) and P{il.xx) are decoupled regularization terms that can guarantee the problem 
to be well-defined. Based on (j2.2p . problem (|2.4p can be decomposed into the following two separate 
problems: 

{Q,yy,Qyx} = Hlg 111111 {Lp^[Qyy , Qyx) R{^yy , i^lyx)} 1 l^'^) 

Q.XX = sxgmhi{H{Vtxx) + P{Vlxx)}- 

^xx>-0 
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We call the first equation specified in (|2.5p as partial Gaussian Graphical Model or pGGM, which 
is the main formulation proposed in this paper. If we assume that both Q*y and O*^, are sparse, 
then we may use sparsity-inducing penalty R(Qyy,^lyx) in ()2.5p . For example, the following two 
penalties enforce element-wise and column-wise sparsity respectively: 

• Element-wise sparsity-inducing penalty: Re{^yy,0,yx) = A„|J7~j^|i + Pn\^yx\i- 

• Column-wise sparsity-inducing penalty: Rc{^yy,^yx) = ^n\^yy\i+Pn\\^yx\\2,i where ||riya;||2,i = 
J2j=i \\i^yx)-j\\- 

If we use the element-wise sparsity-inducing penalty, then the resulting formula is similar to ii- 
penalized full Gaussian graphical model formulation of (j2.1|) . The main difference is that the 
pGGM formulation (j2.5p does not depend on 0,xx, and consequently it does not require the sparsity 
assumption on O^x- One advantage of pGGM is the significantly reduced computational complexity 
when X is high dimensional. Another important merit of pGGM is that it does not depend on 
model assumptions of ^l*^ because the o ptimization has been decoupled. This is analogous to 



the situation of conditional random field ( Laffertv et al.l . |2001| ) where we model the conditional 



distribution of Y given X directly, and good model of the distribution of X is unnecessary or 
ancillary for discriminative analysis. In particular, as we will demonstrate in Section 16. H the 
formulation performs well even if Q^x is relatively dense compared to Q,yy and ^2*^. 



3 Theoretical Analysis 

We now analyze the estimation error between the estimated precision matrix blocks {Qyy,Qyx} 
in (12. Sp and the true blocks {Qyy,^}*^}- Let Syy := supp(il*j^) U {{i,i) : i = and Syy be 

its complement. Similarly we define Syx and Syx- To simplify notation, we denote Q = {flyyjQyx), 
S — ^yy ^yx S — ^yy ^yx • The error of the first-order Taylor expansion of Lpa at in 
direction A0 is 

ALpa(e, AG) := Lpa(e + AG) - Lpa(G) - (VLpa(G), AG). 
We introduce the concept of local restricted strong convexity to bound (5Lpa(G, AG). 

Definition 1 (Local Restricted Strong Convexity). We define the following quantity which 
we refer to as local restricted strong convexity (LRSC) constant at Q: 

/3(G;r,a) = inf |^^^|^^|^ : < ||AG||f < r, [AG^Ii < a|AGs|i} , 

where a = 3max{A„,/9„}/ min{A„,/9„}. 

As will be described in our main result, the Theorem [H that the LRSC condition of Lpa is 
required to guarantee the statistical efficiency of pGGM. Before presenting the theorem, we will 
first show that when n is sufficiently large, such a condition holds with high probability under 
proper conditions. We require the following assumption. 
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Assumption 1. Assume that the following conditions hold for some integers s: 



inf <! r_H£!! : n / 0, H-uHo < s } >0.5, 



sup 



: n 7^ 0, ||ti||o < s f <l-5, 



<1.4. 



The assumption is similar to the RIP condition i n compressed sensing. The following result 
is known from the compressed sensing literature (see Baraniuk et al. . 20081 : Rauhut et al. . 20081 : 
Candes et al.l . l201ll . for example) . 



Proposition 2. There exists absolute constants ci and C2 such that Assumption [I] holds with 
probability no less than 1 — exp(— C2n) when n > ci{p + slog{p + q)). 

Assumption [1] can be used to obtain a bound on f3{Q* ,r, a). 

Proposition 3. Let 

p_ = 0.5min(Amax(f^y2/)"\ Amin(S*.^.))) P+ = l-5Amax(S*.^.)- 

Assume that AssumptionU\ holds with s = \S\ + [4(p+/p_)a^|S'|] . // 



r < min 



o.5A^in(n;^), 0.13 JA^ax [n;,j:i,{n;,y] /p, 



then we have 



/3(G*,r,a) > 



mm 



Amin(3r2 



yy^ 



The following definition of jn is also needed in our analysis. 
Definition 2. Define 

A —V" _ V* _ (o* \-^o* (yi^ — y* \C)*^ (O* 

n —^yy ^yy \^''yy) ^''yxx'^xx ^ XX ) ^ ''yx \^ ''yy ) 
Bn =2(S^^. — '^yx + {^yy) ^yx^xx ~ ^xx))' 

7„ =max{|^„|oo, |-B„|oo}. 

We have the following estimate of 7^. 

Proposition 4. For any rj G (0,1), and given the sample size n > log(10(p + g)^/r?), we have with 
probability 1 — r]: 



7n < 16A/log(10(p + qY/ri)/n 



max(s:j + Ta^m;y)-^ai,Kx^*jx{^iyr^ 



yx 



"yx \""yyJ 



The following result bounds the Frobenius norm estimation error in terms of 7^. 
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Theorem 1. Let Q = (Qyyjflyx) be the global minimizer of (j2.5p with element-wise £i-penalty 
Re- Assume that Xn,Pn G i'^lmCo'ln] for some cq > 2. We further assume that Lpa has LRSC 
at Q* = (f2*j^,Q*3,) with constant /3(0*;r, a) > 0. Consider rQ,j3Q > so that /3(0*;ro,a) > /3o- 
Define A„, = 1.5co/3o'SnVf^- < ro, then 

\\e-e*\\F < i.5co/3o-Snvl5T. 

The following corollary is easier to interpret than Theorem [TJ 

Corollary 1. Let Q = {i),yy,Q,yx) be the global minimizer of ()2.5p with element-wise ii-penalty Re- 
Assume that \n,Pn G [27n5Co7n] for some cq > 2. Define 



/3o 



40Amax(^^. 



mm 



yy 



3Aniin (S^y 



yy 



8Amax(^^ya;S*2.(ri*a,)^) 



ro =min 



0.5A„,in(J^;,), 0.13 J A^ax /P4 



70 =16 



max(S*J + max(((f)*j^) (0* 



Let ci and C2 be absolute constants in Proposition If n is sufficiently large so that 

n > max [ci{p + s log(p + q)), log(10(p + qf/r]), (1. 50070)^(^0/30)^ log(10(p + q)^/7j)] 
with s = \S\ + [4(p_|_/p_)a^|S'|] , then with probability no less than 1 — exp(— C2n) — rj, 



lie - e*||p < 1.5co/3o"So\/l'5| log(10(p + q)yr])/n. 

Proof. Since n > ci{p + slog(p + g)), with probability no less than 1 — exp(— C2ra) — rj, both 
Assumption [1] hold and Proposition [5] are valid. 

Since Assumption [1] holds, Proposition [3] implies (3{@* ,ro, a) > Pq. Since n > log(10(p + g)^/??), 
Proposition m implies that 7„ < Y^log(10(p + qY M) /u^q. Therefore the assumption of n implies 

that A„ < 1.5co/3o'So\/l'S'l log(10(p + g)2/??)/n < ^^o, and Theorem [D implies that ||G - G*||f < 
A„. □ 

We may assume that /3o, r^, and 70 to be 0(1) constants that depend on 0* and S*. The 
corollary implies that when n is at least the order of p + l^l log((p + q)/r]), then 



|e - Q*\\f = 0{^/\S\ logiip + q)/r^)/n) 



4 Numerical Algorithm 



We present a coordinate descent procedure to solve the pGGM problem (j2.5p . The algorithm 
alternates between solving the following two subproblems on Qyy and i^yx respectively: 



''''yy 

^''yx 



arg mm 



arg mm 



Lpa,{^yy,^yx) + R{^yy,^yl) , 
Lipai^y^ \^yx) ~^ R{^yij~ \ ^yx) 
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(4.1) 
(4.2) 



Since the objective is convex, it is guaranteed that the above procedure converges to the global 
minimum. Let us first consider the minimization problem (j4.ip . This is equivalent to 



arg mm 



(4.3) 



where 



F^'\nyy) := -logdet(0,,) +tr(S^^f^,,) +tr(E gJOg)^0 ;^^»g). 

In ou r implementation, the proximal gradient descent method (|Nesterovl . 1200.4 iBeck fc Teboullel . 
20091 ) is utilized to solve the above composite optimization problem, where the gradient of the first 
(smooth) term of (j4.3p is given by 

\/ \^}yy) = —^yy + ^yy ~ ^j/j/ ^xx (^yx ) ^yy • 

Next, we consider the minimization problem (|4.2|) . This is equivalent to 



^''yx 



arg mm 



(4.4) 



where 



G^'\ny^) :=tr(Sj 



y'^x) + 2tr(S 



yx '•''yx) 



^yxj ■— ^''\^xx^^yx\^^yy J ^^yxj 

Again, we apply the proximal gradient method to solve this subproblem. Here the gradient of the 
first (smooth) term of (j4.4p is given by 



VGW(f),x) = 2(J](* 



yy 



-1q yn inyn 
^'-yx^xx ' ^^yx- 



The computational complexity in terms of p and q for this coordinate descent algorithm is as 
follows: (1) 0{p^ + p'^q + pqm.m{n, q}) for the subproblem (14. ip due to the inverse of i^yy and the 
matrix product in the evaluation of gradient VF^^\^yy); and (2) 0{p^q + pqmm{n,q}) for the 
subproblem (14. 2p from matrix product in evaluating gradient V G^^\il.yx) ■ Therefore, the overall 
complexity of the proposed algorithm is 0(j)^ + p'^q + pgmin{n, g}). This can be compared to the 
0{{p + q)'^) or higher per iteratio n complexity required by well known representative algorithms for 
full precision m atrix estimation ( Friedman et al. . 20081 : d'Aspremont et al. . 2008 : Rothman et al. 



^aO). In the high dimensional setups where , » max{n,ri, the computational advantage 
of pGGM over standard GGMs can be significant. 



5 pGGM for Multivariate Regression with Unknown Covariance 

In this section, we show that pGGM provides a convex formulation for solving the following model 
of multivariate regression with unknown noise covariance: 

Y = r;,X + ey, (5.1) 

where Y £ W, X G T*^ is a p x q regression coefficient matrix and the random noise vector 
Ey ~ M{0, {Q.yy)~^) is independent of X. Our interest is in the simultaneous estimation of F*^. and 
Clyy from observations {5^^*^; in the high-dimensional setting. Note that for this regression 

problem we do not have to assume the joint normality of {Y;X), but rather that the noise term is 
normal (or more generally sub-Gaussian). Our discussion in this section is based on the fact that 
pGGM is a regularized maximum likelihood estimator for multivariate regression with Gaussian 
noise. 
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5.1 pGGM as a Conditional Maximum Likelihood Estimator 



We will start our discussion under the joint Gaussian setup, which provides the connection of the 
pGGM formulation and multivariate regression. Let the true covariance matrix S* be partitioned 
into blocks 



yy yx 

yx XX 



Here we assume that {Y; X) is jointly normal, the conditional distribution of Y given X, given as 
follows, remains normal: 



(5.2) 



Now by using algebra for block matrix inversion, we may write the precision matrix 0* 
as 



and thus 



\ yy ~ ^yxK^xx) ^ 



yx ) 



-1 



\^yy ~ ^yxX^xx) ^yx ) ^yxK^xx) 



K'^xx) yx \ yy ^yx\'^xx) yx ) 



□ 



^yx x^x'i 



-1 yi*T 

yx 



-1 



yy \ yy yx\^xxj yx J ' ""yx ""yy^yx 

Therefore the conditional distribution (j5.2|) can be rewritten as: 



^''yy-^yxK-^xx) 



(5.3) 



Y I X 



M{-{n*j-'n;^x,{n*j- 

This can be equivalently expressed as the following multivariate regression model: 

y = — (rj* ) ^n*x + ey, 



(5.4) 



where Ey ~ 



AA(0, {Q,*y)~^) is independent of X. Note that this model can be regarded as a reparam- 
eterization of the standard multivariate regression model in ()5.ip . It is easy to verify that given the 
observations {5^^*-*; the negative of the conditional log-likelihood function for Sy is written 

by 

- log detinly) + tvij:;yn*yy) + {^^j + t r ( ^ ^ ^ ^ ( ) " ^ ^ ^ . 

which is exactly Lps,{Q,*y,i^*^) given by (j2.3p . Therefore, pGGM is essentially a regularized condi- 
tional maximum likelihood estimator for the regression model ()5.4p . This implies that we can use 
pGGM to solve multivariate regression problem with unknown noise covariance matrix i^yy 



5.2 Convexity and cGGM 

We now consider the general multivariate regression model (|5.ip with Gaussian noise. A more 
straig htforward method for estimating the model parameters {^yy, ^yx} was considered bv lYin &: Li 
(l201l|) using the following £i-regularized log- likelihood function associated with e^: 



{Clyy,tyx} — aigmin <—logdetflyy+tr(TiYy^^yy) + Xn\{^yy) |i + Pn I ^yx 1 1 

(5.5) 
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where Sp^^ = T,yy — T,y^Ty^ — Tyx'SyJ. + TyxT,'^xTyx. However, with this formulation, the objective 
function in (j5.5p is not jointly convex in Ty^ and ^yy, although it is convex with respect to Tyx for 
any fixed i^yy, and it is also convex respective to Clyy for any fixed Ty^. 

In contrast, the expression (j5.4p is jointly convex in {Qyy,Qyx}, which may be regarded as a 
convex reparameterization of (j5.ip under the following transformation: 

^yy ^yy ' ^y^ ^yy ^yx- 

This transformation yields a one-to-one mapping from {i),yy,Tyx} to {Qyy,Qyx}. The convexity of 
(j5.4p is desirable both for optimization and for theoretical analysis which we considered in Section [3j 
It is worth mentioning that for high dimensional problems, regularization has to be imposed 
on the model parameters. With regularization, the pGGM regression formulation (j5.4p becomes 
(j2.5p . which is different from the cGGM formulation of (I5.5p . This is because for pGGM, the 
£i-norm penalties are imposed on {Qyy,il.yx}, and for cGGM, the £i-norm penalties have to be 
directly imposed on {^yy, ^yx}- The former has a natural interpretation in terms of the conditional 
dependency between the variables in X and Y, while the latter does not have such an intuitive 
interpretation. 



5.3 Univariate Case 

As a special case, when the output Y is univariate, pGGM reduces to a regularized maximum 
likelihood estimator for high-dimensional linear regression with unknown variance. In this case, 
by replacing the scalar ^lyy and the row vector D.yx with to and respectively in the pGGM 
formulation (j2.5p . with element-wise ^i-penalty Re, we arrive at the following estimator: 

{u;,9} = argminLpa(t^,6') +p||6'||i, (5.6) 

uj>o,e 

where 

Lpa(w, 9) := - log{u) + S> + 29^^"xy + 9'^^^J/oj. 

As aforementioned that this is identical to a regularized maximum likelihood estimator for the 
following linear regression model with unknown variance: 



Y 



-UJ 



(5.7) 



where e ~ A/'(0,w ^) is independe nt of X. The specifi c ^i-penalized maximum likelihood estima- 
tor (j5.6p has also been studied by IStadler et all (j20ld ') for sparse linear regression with unknown 
noise covariance. For multivari ate random vector Y , pGGM can be regarded as a multivariate 
generalization of the method in (IStadler et al.l . l20ld ). 



For graphical model estimation, pGGM with univariate Y can als o be regarded as a variant of 
the neighborhood selection method ( Meinshausen &: Biihlmann . 20061 ). Let us write the entry 
of Q, at the jth row and the jth column, and denote by ^j-j or ^-j^j the jth row of O with its jth 
entry removed or the 7th column with its 7th entry rem oved respectively. In order to recover the 
non-zero entries in 0, iMeinshausen &: Biihlmann (|2006l ) proposed to solve for each row j a Lasso 
problem: 

(5.8) 



argmin^ ' + 29 ' S!!,- + p\\9\\i. 



-3,3 
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If we fix cl! = 1 in (j5.6p . then the resultant estimator is identical to (jS.Sp . For precision matrix 
estimation, our formulation (|5.6p is different from neighborhood selection (jS.Sp due to the inclu- 
sion of oj as an unknown parameter. More precisely, the quantity is the noise variance for 
the corresponding Lasso regression, and the estimator (|5.6p may be regarded as an extension of 
neighborhood selection without knowing the noise variance. For multivariate random vector Y , 
pGGM can be regarded as a blockwise generalization of neighborhood selection for graphical model 
estimation. 

Fo r precision matrix estimation, the regression model ()5.7p has also been considere d by lYuanl 

However, the author suggested a procedure to estimate 6 via the Dantzig-selector (jCandes &: Taol . 
20071 ) followed by a mean squared error estimator for the variance cj^^. In contrast, the pGGM 



based estimator (15. 6p simultaneously estimates the two parameters under a joint convex optimiza- 
tion framework. 



6 Experiments 

In this section, we investigate the empirical performance of the pGGM estimator on both synthetic 
and real datasets and compare its performance to several representative approaches for sparse 
precision matrix estimation. 

6.1 Monte Carlo Simulations 

In the Monte Carlo simulation study, we investigate parameter estimation and support recovery 
accuracy as well as algorithm efficiency using synthetic data for which we know the ground truth. 

6.1.1 Data 

Our simulation study employs a precision matrix il* whose sub-matrices and Vty^ are sparse, 
while 0*^ is dense. The matrix is generated as follows: we ffist define Vt* = M + al, where each 
off-diagonal entry in M is generated independently and equals 1 with probability P = 0.1 or with 
probability 1 — P = 0.9. M has zeros on the diagonal, and a is chosen so that the condition number 
of ri* is p -h q. We then add the q x q all-one matrix to the block O*^, and the resultant matrix is 
defined as Jl*. We generate a training sample of size n from J\f{0, S*) and an independent sample of 
size n from the same distribution for validating the tuning parameters. The goal is to estimate the 
sparse blocks {Q*y,Q*^}. We fix (n,p) = (100,50) and compare the performance under increasing 
values of q = 50, 100, 200, 500, replicated 50 times each. 

6.1.2 Comparing Methods and Evaluation Metrics 

We compare the performance of pGGM to the following three representative approaches for sparse 
precision matrix estimation: 



cGGM for conditional Gaussian graphical model estimation (jYin fc Lil . l201ll ) . After recover 



ing the regression parameters Tyx and the conditional precision matrix i^yy, we estimate the 



GLasso for £i -penalized precision matrix estimation ( Friedman et al. . 20081 ). We convention 



ally apply GLasso to estimate the full precision matrix Ct. 
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NSLasso for support recovery ( Meinshausen Sz Biihlmann . 20061 ). We use a modified version 
to recover the supports in the blocks Q*y and r^*^ by regressing each Yi on Y^i and X via the 
Lasso. Such a modified neighborhood selection method has also been adopted by lYin &: Li 



(|201ll ) for their empirical study. Note that this method does not provide an estimate of the 
precision matrix. 



For all methods, we use the validation set to estimate the values of the regularization parameters. 

We measure the parameter estimation quality of = {(^yy, ^yx) by its Frobenius norm distance 
to 0* = {Cl*y,Cl*^). To evaluate the support recovery performance, we use the F-score from the 
information retrieval literature. Note that precision, recall, and F-scores are standard concepts in 
information retrieval defined as follows: 



Precision 
Recall 
F-score 



TP/(TP + FP) 
TP/(TP + FN) 

2- Precision- Recall 



Precision+Recall ' 

where TP stands for true positives (for nonzero entries), and FP and FN stand for false positives 
and false negatives. Since one can generally trade-off precision and recall by increasing one and 
decreasing the other, a common practice is to use the F-score as a single metric to evaluate different 
methods. The larger the F-score, the better the support recovery performance. 



6.1.3 Results 



Figure 1(a), 1(b), 1(c) plot the mean and standard errors of the above metrics as a function of 



dimensionality q. The results show the following: 



Parameter estimation accuracy (see Figure 1(a)): pGGM and cGGM perform favorably to 



GLasso. This is expected because GLasso enforces the sparsity of the full precision matrix 
and thus tends to select a smaller regularization parameter due to the dense structure of block 
n*^^. In contrast, pGGM and cGGM exclude n XX in the model and thus avoid potential under 
penalization of sparsity. pGGM and cGGM perform comparably on parameter estimation 
accuracy. Note that NSLasso does not estimate the precision matrix. 



Support recovery (see Figure 1(b)): pGGM achieves the best performance among all four 



methods being compared. pGGM outperforms cGGM since the former directly enforces the 
sparsity on blocks Qyy and ^yx while the latter enforces the sparsity of Fyx = —0,yyO,yx which 
is not necessarily sparse. GLasso is inferior due to the under penalization. We also observe 
that pGGM is slightly better than NSLasso. 



• Computational efficiency (see Figure 1(c)): The pGGM and cGGM methods can achieve 
X 100 speedup over GLasso when q = 500. 

We further compare pGGM to GLasso applied to the marginal distribution of Y by ignoring 



X. We call this method as GLasso-M. The results are plotted in Figure 1(d), 1(e), 1(f) It can be 
observed from these figures that pGGM consistently outperforms GLasso-M in terms of parameter 
estimation and support recovery accuracies. 

The detailed performance figures that are used to generate Figure 16.11 are presented in Ap- 
pendix[B]in tabular forms, along with additional performance metrics in spectral norm and matrix 
£i-norm. The observations using the other norms are consistent with that of the Frobenius norm. 
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Figure 6.1: Performance curves on the synthetic data. Top row: comparison of the estimated 
blocks {^}yy,Qyx}. Bottom row: comparison of the estimated ^}yy by pGGM and GLasso-M. The 
down-arrow J, means the smaher the better while the up-arrow '[ means the larger the better. 



6.2 Real Data 

We further study the performance of pGGM on real data. 
6.2.1 Data 

We use three multi-label datasets CorelSk, MIRFIickr25k and RCVl-v2 and a stock price dataset 
S&P500 for this study. For each dataset, we generate a training sample for parameter estimation 
and an independent test sample for evaluation. Table 16.11 summarizes some statistics of the data. 
We next describe the derails of these datasets. 



CorelSk. This dataset was first used in (jPuvgulu et al.l . |2002| ). Since then, it has become a stan- 



dard benchmark for keyword based image retrieval and image annotation. It contains around 
5,000 images manually annotated with 1 to 5 keywords. The vocabulary contains 260 visual 
words. The average number of keywords per sample is 3.4 and the maximum number of key- 
words per sample is 5. The data set along with the extracted visual features are publicly available 
at lear.inrialpes.fr/people/guillaumin/data.php, In our experiment, we down sample the 
training data to size 450 for constructing the Gaussian graphical models of image keywords. For 
evaluation purpose, an independent test set of size 450 is selected. Each image is described by 
the GIST feature which has dimensionality 512. Our goal is to construct a graphical model for 
image tags. Note that the size of label-feature joint variable is 260 -1- 512 = 772, which allows us to 
examine the performance when p + q > n. 
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Table 6.1: Statistics of data. 





P 


q 


training size (n) 


test size 


CorelSk 


260 


512 


450 


450 


MIRFIickr25k 


457 


512 


1,250 


1,250 


RCV1-V2 


103 


1,000 


1,000 


1,000 


S&P500 


165 


300 


101 


101 



MIRFIickr25k. This data contains 25,000 images collected from Flickr over a period of 15 months. 
The database is available at pres s . liacs .nl/mirf iickr/[ The collection contains highest scored 
images according to Flickr's "interestingness" score. These images were annotated for 24 con- 
cepts, including object categories but also more general scene elements such as sky, water or 
indoor. For 14 of the 24 concepts, a second and more strict annotation was made. The vocabu- 
lary contains 457 tags. The average number of words per sample is 2.7 and the maximum words 
per sample is 32. The data set along with the extracted visual features are publicly available 
at lear.inrialpes.fr/people/guillaumin/data.php, In our experiment, we down sample the 
training set to size 1,250 for constructing the Gaussian graphical models of image keywords. For 
evaluation purpose, an independent test set of size 1,250 is selected. Each image is described by 
the GIST feature of dimension 512. Our goal is to construct a graphical model for image t ags. 
RCVl-v2. This data set contains newswire stories from Reuters Ltd Lewis et al. sev- 



eral schemes were utilized to process the documents including removing stopping words, stem- 
ming, and transforming each document into a numerical vector. There are three sets of cat- 
egories: Topics, Industries and Regions. In this paper, we consider the Topics category set, 
and make use of a subset collection (sample size 3,000, feature dimension 47,236) of this data 
from www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets, We further down sample the data 
set to a size of 1,000, and select the top 1,000 words with highest TF-IDF frequencies. For eval- 
uation purpose, an independent test set of size 1,000 is selected. The vocabulary contains 103 
keywords. The average number of words per sample is 3.3 and the maximum words per sample is 
12. Our goal is to construct a graphical models for these keywords. 

S5dP500. We investigate the historical prices of S&P500 stocks over 5 years, from January 1, 2007 
to January 1, 2012. By taking out the stocks with less than 5 years of history, we end up with 
465 stocks, each having daily closing prices over 1,260 trading days. The prices are first adjusted 
for dividends and splits and the used to calculate daily log returns. Each day's return can be 
represented as a point in M^^^. For each day's return, we chose the first 300 as X and the rest 165 
as Y. We down sample the data set to size 101. For evaluation purpose, an independent test set 
of size 101 is selected. Our goal is to construct the conditional precision matrix of Y conditioned 
on X. 

6.2.2 Methods and Evaluation Metrics 

In these experiments, we compare pGGM to GLasso, GLasso-M (for estimating marginal precision 
matrix using the data component Y only) and NSLasso. Here we focus on convex formulations, 
and thus skip cGGM. For all these methods, we use the Bayesian information criterion (BIG) to 
select the regularization parameters. 

Since there is no ground truth precision matrix, we measure the quality of G by evaluating 
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Table 6.2: Quantitative results on real data 



Lpa value on test set CPU Time (sec.)on training set 





pGGM 


GLasso 


GLasso-M NSLasso 


pGGM 


GLasso 


GLasso-M 


NSLasso 


CorelSk 


-1.08e3 


-0.63e3 




16.63 


125.74 


9.07 


9.06 


MIRFIickr25k 


-1.99e3 


-1.99e3 




56.93 


228.71 


39.74 


42.89 


RCV1-V2 


-0.42e3 


-0.39e3 




3.04 


421.86 


1.38 


75.43 


S&P500 


0.22e3 


0.24e3 




4.83 


46.65 


4.28 


4.29 



the Lpa objective (recall its definition in (j2.3p ) on the test data. The training CPU times are also 
reported. Since the category information of RCVl-v2 and S&iP500 are available, we also measure 
the precision of the top k links in the constructed conditional GGM from ^lyy on these two datasets. 
A link is regarded as true if and only if it connects two nodes belonging to the same category. Note 
that the category information is not used in any of the graphical model learning procedures. 

6.2.3 Results 

Table 16.21 tabulates the evaluated Lpa objectives on the test set and the training time. The key 
observations are 

• In most cases, pGGM outputs smaller Lpa objective value than GLasso (note that the Lpa 
value cannot be evaluated for GLasso-M and NSLasso) . pGGM runs much faster than GLasso 
on all these datasets. 

• pGGM is slightly slower than NSLasso on Corel5k, MIRFIickr25k and S&iP500 where f ~ g, 
but significantly faster than NSLasso on RCVl-v2 where p <C (?. 

Figure ED shows the precision of top k links in the conditional graphs as a function of k. It can be 
seen that pGGM performs favorably in comparison to the other three methods for identifying correct 
links on RCVl-v2. On S&P500, pGGM and GLasso-M have comparable performance, and both are 
better than GLasso and NSLasso. This is because the S&iP500 stocks are weakly correlated and 
thus the conditional graphical model can be well approximated by the marginal graphical model. 

We further evaluate the sparsity of the constructed graphs on these datasets. The links are 
identified by : i ^ j, \ ['^yy]ij\ > A*} in which > is a threshold value. Figure lOl shows the 

number of links in graphs as a function of ^. It can be seen that pGGM, GLasso and NSLasso tend 
to output sparser graphical models than GLasso-M. A potential reason is that GLasso-M ignores 
the information provided by X, and thus false positive links can be induced. NSLasso outputs the 
sparsest network on corelSk, MIRFIickr25k and S&P500, while pGGM outputs the sparsest model 
on RCVl-v2. Note that NSLasso does not estimate precision matrix. Moreover, pGGM tends to 
be slightly sparser than GLasso. These observations are consistent with our observations on the 
synthetic data. 

Figure [63] plots the graphs constructed by using different estimation methods with ^ = 0.1 
for CorelSk, MIRFIickr25k and RCVl-v2, and n = 0.05 for S<S^P500. It can be seen that different 
methods will construct different graphs. Figure 16.51 illustrates in detail the top 50 links in each 
graph. 
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Figure 6.2: Link precision curves on RCVl-v2 and S&;P500. 




Figure 6.3: Number of links as a function of in the constructed conditional graphical model. 
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7 Conclusion 



This paper presents a new formulation pGGM for estimating sparse partial precision matrix. The 
advantages of pGGM over prior GGMs and conditional GGMs include: (i) the formulation is 
convex; (ii) the optimization procedure scales well with respect to the component X; (iii) the 
model has natural interpretation in terms of the conditional dependency between the variables in 
X and Y; and (iv) theoretical guarantees on the global solution can be established without sparsity 
assumptions on the precision matrix of X. We showed that the rate of convergence of pGGM 
depends on how sparse the underlying true partial precision matrix is. Numerical experiments on 
several synthetic and real datasets demonstrated the competitive performance of pGGM compared 
to the existing approaches. 

In the current paper, the pGGM is derived under the assumption that {Y; X) is jointly normally 
distributed. As discussed in Section [5] that pGGM is still valid in the setting where the joint 
normality is relaxed to the conditional normality. We would like to point out that by assuming 
the Gaussian copul ar structure of th e random vector, pGGM can be easily extended to the setting 
of nonparanormal ( Liu et al. . 20091 ) which is a useful tool for semiparametric estimation of high 



dimensional undirected graphs. We believe that such an extension will broaden the application 
range of pGGM in practice. 
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A Technical Proofs 



A.l Proof of Proposition [T] 

Proof. Using the following well known fact of block matrix determinant 



det 



A 5' 
B C 



1 dT\ 



det{A) det{C - BA^^B 



and simple algebra, we obtain that 
where 



)+tv{j:^xi^xx-ny,n-y'nyx)), (a.i) 



yx) 



logdet(0™)+tr(S" f). 



+ 2tr(S-JO 



yx. 



+ ^"^^xx^yx^yy ^yx) 



We next show that Lpa,{Qyy , Qy^) 



yx^'^'yy ''''yx- 



sides of (jA.ip over ^Ixx, which is achieved at Q^x = i'^xx) ^ + ^yx^yi^^yx, we know that up to an 



The claim (|2.2p follows immediately from the re-parametrization of Qxx = ^xx 

is convex. Note that when S"^ >- 0, by minimizing both 

i^xx 

additive constant, Lpa is the pointwise minimum of L over il-xx- Since the pointwise minimization 
of a convex objective function with a part of the p arameters is convex with respect to the other 
parameters (see, e.g.. lBoved fc Vandenberghd . |2004| ) . we immediately obtain the convexity of Lpa. 
In the high-dimensional case where n < g, we only have ^ and thus the minimization over 
rixx is not well-defined. To show the convexity in general case, we may replace Tixx by Tixx + AI 
for some A > 0, and the resulting partial GMM formula: 

L^^inyy, Qyx) = " log det(0,j,) + tf ( S;)^ ) + 2tr(S^j0,,) + tr((S^, + A/)!)^ j^^^ij^^^) 



is convex in {ftyy,ftyx) by the previous argument. Now, let A — t- O"*", we have Lp^{Qyy , ^lyx) 
Lpai^yy , ^yx) , which immediately implies the convexity of Lpa(-,-)- 



□ 
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A. 2 Proof of Proposition [3] 

Lemma 1. Assume the conditions of the proposition hold. Then for any matrix V = (Vyy, Vyx) £ 
RPXP X such that \Vs\i < a\Vs\i, we have 



ii{VtV^) > ^\\v\\l, 

5 



where 



^xx 



Moreover, we have 



tr(VS^,Kl)<2.25p+||y||2,. 



Proof. In the following, we let s = \S\ and s' = s — s > 4(p+/p„)a^s. Since r < Amax(^^y)i we 



know that Xras.^{^yy) ^ > P-- Indeed, XimxA^yy) < Amax(^^^y) + Amax(Ai7yy) < Amax(^^yy) + r < 

2Aniax(^^yj;), which from the definition of p~ implies that XmsLxi^yy)^^ > Therefore for any 
U G MP^(p+'?) such that \U\q < s + s', the conditions of Assumption [1] imply that 



We order the elements of Vg in descending order of absolute values. Let V^^^ = Vs which contains 
s nonzero values, and V^''^ contains (at most) s' nonzero values of with (ks' — s' + l)-th to 
(A:s')-th largest absolute values. It follows that H^^^^+^^Hf < v1"^^(^+^W^^^^+^ < \V^'''^\i/Vs' 
for all /c > 1. Therefore we have 



ao 



and 



ai=|tr((yw + y«)s^y('=+i)^)| 

k>l 

k>l 
k>l 

Note that tT{VTiV^) > oq — 2ai + 02, where 



a2 = tr 



V 



v^^+^^ \^\Y. ^^^^^ 



, fc>i 




The semi-positive-definiteness of S implies that min^[ao + 2/iai + /U^a2] > 0, which implies that 
a\ < 0002- Therefore 

tr(y^Sy) >ao — 2ai + 02 > oq — 2ai + a^/oo 

>/,_||yW + - aV(p+/p-)(sA'))' > +F« 11^/4, 



where the last inequality is due to the definition of s' that implies that ay/ {p+/ p^){s/ s') < 0.5. 
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Moreover we have 



|=||y(o) + y(i)||| + ^||F('=+i)||| 

fc>i 

<||y(0) + y{i)||| + ^||yW||2/,/ 



k>l 

<||y(0)+y(l)|||+||y(l)||^||y_||^/,' 

<||y(0) + yW\\l + a||y«||2||F5||2vW 
<(1 + 0.5qv^)||F(°) + < 1.25||yW 

By combining the previous two displayed inequahties, we obtain the first desired bound. 
To prove the second bound, we define 



Opxp 





Therefore for any U £ MP^(p+'J) such that \U\o < s + s', the conditions of Assumption [T] imply that 



tr{UJ:'U^) < p+\\U\{] 



Therefore we have 
and 



=tr 



V 



fe>l / \A;>1 

fc>l fc'>l 

k>l k'>l 



fc>l k'>l 
2 iTA |2 / / 



<QV+|Vs|t/s' < aV^ 
Therefore we obtain (using a'^{s/s') < 0.25) 



tr(y,,SS,V;];) <a', + 2 + 4 < 1.5a'o + 3^ < (1.5 + 3/4)p^ 



2.25||y||^ 



This completes the proof. 
Lemma 2. Let 

then we have 



□ 



'd = min 



Amin(^j,y) 



_3 8Aniax(^^a;5^xx(^yx)^) 
Ama.(f^;,'^^,xS^.J^Jj < 1/(2^?). 
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Proof. Let a{A) be the largest singular value of a matrix A, then a{A) = \J Amax(^^^)- Therefore 
we have 



<v/Vax[J^,*:.SS,.(J^;JT] + 1.5^11 Afilli. 



<1.4^A^ax[J^;.S*.,(0;JT]^ 
where the third inequality uses the second inequality of Lemma [H and the last inequality uses the 



third inequality of Assumption [T] and ||Ail||ir < r < O.lSy Amax This implies 

Since the assumption of r < Amin(f^yy)/2 also implies that 
Therefore we have 



\ ('O-Iq V" W ^ ^- -max V" "yx/ J _ , (,^ 



which leads to the desired bound. □ 
Proof of Proposition\^ For any s G (0, 1), we define for convenience that 

^yy — ^yy ~^ s^^yyj — ^yx ~^ ^Af^^^, 

and consider the function f{s) defined as 

f{s) := -logdetinyy)+trii:^ynyy) + 2tTi^^Jnyx) + ti{Kx^'^x^yy^y-)- 
It can be verified that 

f'is) = - trin^y^Anyy) + tr(S^^A17,,) 

+ {T,yj A^lyx) + {T,^^^}y^Qyy Ailyx) " tl (Ti^ ^^^y yy yy^ yy y x) 

and 

ns) =tr{n-y'Anyynyy'Anyy) + 2tii^:,An]^,n-y'Anyx) - MKx^yx^yy^^yy^yy^^yx) 

+ '^^''^i^xx^yx^yy '^^yy^yy '^^yy^yy^yx) ■ 

We obtain from Taylor expansion that 

ALpa(G*, AG) = ^/"(s), for some s G (0, 1). 
24 



This implies that 

= tr(Qyy A^lyy^lyy A^lyy) + 2tr(l]^ri.Aily^^lyyAQyx) " 4:tl {T,^ ^^y yy AD.yyQ yy A^^, y x) 

+ + '&)tT{T,'^^Qy^i}yy Ai}yyi}yy AQyyQyyQyx) " 1 1 ( ^ ^J, ^ ^ A ^ ^ ^ ^ ^ A ^ ^ ^ ^ ^ ^ ^ ) 



2t? 



> tv{n-y'Anyyn-y'Anyy) + ^tr(s^,Af)J,Q-^iAQ, 



'!?tr(r2j^j^ ^ ^lyxTixx^yx^yy^ ^yy^ ^^yy^yy ^^yy^yy ^ ) 



"j/j/ •'"yx^xx'^^yx'^^yy ""yy ^""yy^yy ^""yy "yy 

> tr{nyy^Anyyn-y^Anyy) ' ^ o-i , 



+ 



2 + 1? 



triJ^^x^n^x^y^Anyx) 



maxV'^'yj/ '•^yx^xx^^yx^''yy 1^ \ yy VV yy VV yy ) 



> OMlin-^Anyy^-^Anyy) + 



j^iriJ^^x^n'^x^-y'Anyx), 



where we have used the trace equahty tv{AB) = tv{BA) throughout the derivations. The first 
inequahty is due to the trace inequahty {2/{2 + i&))tT:{A^ A) - 4tr{A^ B) + {2 + i})tr{B^ B) > 0; the 
second inequahty uses ti{AB) < Amax(^)ti'(-B) for symmetric positive semidefinite matrices A and 

B; and the last inequality is due to X^^.^i^yy^'^^yx^xx^yx^yy^'^)^ — 1/2 (Lemma[2]). 
Since ?? < 2/3, we have 0.5 > 2??/(2 + /). Therefore 



2ALpa(e*,AG) =/"( 



> 



> 



> 



2?? 



1-1 



Aa, 



2 _|_ ^ ^^(^yy ^^yy^yy ^^yy) + ^^C^xx^^yx^yy '-^^^yx 

2'd r 
J^^^^A^yy) triAQyyn-y^Anyy) + triAQyxKx^^'^ 

'^^KU^yy)P~- 



yxj 



5(2 + 1?) 



lAGI 



where the second inequality uses iT{AB) > Amin(^)tr(-B) for symmetric positive semidefinite ma- 
trices A and B; and the last inequality follows from Lemma [H We complete the proof by noticing 
5(2 + t?) < 40/3. □ 



A. 3 Proof of Proposition [4] 

We will employ the following tail-bound for random variable, due to lLaurent fc Massart (I2OOOI ). 



Lemma 3. Consider independent Gaussian random variables zi, . . . ,Zn ~ AA(0, cr^). We have for 
all t > 0; 



Pr 



^ z| > nu^ + 2a'^Vni + 2cr^t 



< e" 



and 



Pr 



< e~ 
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The following lemma is a consequence of Lemma[3]wlien applied to the covariance of multivariate 
Gaussian distribution. 

Lemma 4. Consider the covariance matrix S* of a d- dimensional Gaussian random vector and its 
sample covariance S" from n i.i.d. Gaussian random vectors from AA(0, S*). For any rj € (0,1) 
and any deterministic d' x d matrix A. Let 



a = max 

ij 



{A^*A')u + 2\{A^*)i,\ + {^* 



in 



then with probability at least 1 — r] for any r] G (0, 1), we have 

- S*)|oo < 2aVln(W/r/)/n, 

provided that n > In^Add' / rj) . 

Proof. Consider the multivariate Gaussian random vector X^^\ . . . , X^") ~ AA(0, S*). 

Given any index pair let = (AXW)i + xf. We have ~ AA(0, (ASM"r)ii + 

2{AT,*)ij + We thus obtain from Lemma [3] that for t < n: with probability at least 

1 - 2e-*, 



n 



'jji 



< 4a^y/t/n. 



Similarly, we have for t < n: with probability at least 1 — 2e 



n 



1 ^{AX(% - Xff - [{AT.*A~')u - 2{AT.*),j + (S*),,- 



< 4.a'^^/t/n. 



Taking union bound, and adding the previous two inequalities, we obtain that with probability at 
least 1 — 4e~*: 



n 



1 Y,{AX^% + Xff - [(ASMT),, + 2(AS*),, + (S*),-,] 

£=1 
n 

Y,iAX^% - Xf? - U^*A^h - 2{A^*)ij + i^lj 



£=1 



< 8a^y/t/n. 



This simplifies to - < 2a'^ .Jtfn. Now by taking union bound over i = 1, . . . , d' and 

j = 1, . . . ,d, and set t] = Add'e~*, we obtain the desired bound. □ 

Note that in Lemma [H we have cr^ < 2maxi{AT,* A^)ii + 2maxi(S*)ii. It implies that with 
probability 1 — rj: 



\A{T.'' - S*)|oo < 4[max(ASM^)ii + max(S*)ii] 0n( W /??) /n 

i i 

when n > ln(Add' / r]) . 



(A.2) 
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Proof of Proposition^ For any r] G (0, 1) such that n > ln(10(p + q)'^/r]), we obtain from ()A.2p 
with A = I that with probabiUty 1 — OAr]: 

IS'^ - S*U < 8max(S*)iiv'ln(10(p + qYh)/n. 

i 

Let ^ = (r2*j^)~^r2*^. We may also apply (IA.2|) to the Gaussian covariance matrix ATil.^A^ and 
A = / to obtain that with probability 1 — 0.4ry: 



n. 



Similarly, we may also apply (|A.2p to the Gaussian covariance matrix S* with ^ = ^ to obtain 
that with probability 1 — 0.2ry: 

- iS^Joo < 8max(is:,i^)iiyh^(20^^A^. 
Taking union bound with the previous three inequalities, we have with probability 1 — ry: 



and 

0.55„ < |S" - S*|oo + - i5]*Joo < 8K,Vln(10(p + g)2/r/)/n, 



where 



This completes the proof. □ 
A. 4 Proof of Theorem [1] 

For convenience, we will introduce the following notations: 

AO — O _ O* AO — n — O* 

yy ' — yy yy^ ^^^yx • — ^^yx ^^yx^ 

and AG = G - G* = {AQyy, AQy^). 

We first introduce the following lemma which shows that error is in the cone of Definition [TJ 

Lemma 5. Assume that min{A„,Pn} ^ 27„. Then the error AG satisfies [AG^Ii < ajAGsli. 

Proof. Since {^ly)syy — 0; have 

miy + Ao,,)-|i - mi^rii = + Any^y^j^ + miy + Anyy^j^ - miyru 

= \{n*yy + Anyy),j, + \{Anyy)s-j, - 1 (o;^ ) " | ^ 

> \{Anyy)-=^Jl - \ {Anyy)gJl 

> \{^^yy)syy\i-\myy)Syy\i. (A.3) 

Similarly we have 

+ Ao,,|i - \n;j, > \myx)syji - \myx)syji. (a.4) 
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We define the function f{s) as in the proof of Proposition [3l From the convexity of the loss Lpa 
we have 

ipa(0) - = /(I) - /(O) > /'(O) = tr{AlAnyy) + tr{BjAny,), 

where 

■^n = '^yy ~ i^yy) ~ i^yy) ^yx^xxi^yx) i^yy) j = '^{'^yx ~^ (^yy^ ^yx'^xx)- 

From the equahties in (|5.3p we can equivalently write 

^1'^ ~ '^yy~'^yy~ i^yy) ^yxi'^xx~'^xx)^yx i^yy) t -^^n = '^{'^yx~'^yx~^ i^yy) ^yxi'^xx ~ '^xx)) ■ 

Note that we have 

\tl{A^Anyy)\ < \An\oo\Anyy\l < y|AQyy|l, 

and 

where we have used the assumption min{A„,p„} > 27„. Therefore 

- > -Y^^^yy^i - ylA^^^^xli- (a.5) 

By combing ()A.3p . ()A.4p . and (jA.Sp . we obtain 

> v(e) + Re{e) - Lpa(e*) - Re{Q*) 

> ^{AQyyll - ^\Any^\l + \ni\iAnyy)s^Jl " | ( j^y ) J 1 ) + /9„ ( | ( AJ^y^^ ) fi^ J 1 " | ( Afij^^ ) J 1 ^ 

> ^ (|(AO,,)^^Ji - 3|(AJ],,)5„Ji) + ^ (|(A0,,.)5,Ji - 3|(AJ7,.)5,jr 

> (l(AJ^,,)5,Ji + l(Af)..)5.Ji) - (l(Al^..)5.Ji + l(A0..)5.Ji) , 
which implies |(Ae)5|i < a|(Ae)5|i. □ 

Proof of TheoremUl Since Xn,Pn G [27rnCo7n], by Lemma[5]we have |(A6)g|i < a|(A6)5|i. Let 
AG) = {AQyy,A^yx) = t AQ where we pick t = 1 if ||A6||f < "To and t G (0, 1) with ||A6||f = "Tq 
otherwise. By definition, we have ||A0||i? < ro and |(A0)^|i < aKAB)^!!. Due to the optimality 
of and the convexity of Lpa, it holds that 

Lpa(G* + tAe) + Re{e* + tAQ) < Lpa(e*) + Re{e*). 

Following the similar arguments in Lemma [5] and the LRSC of Lpa we obtain 

> Lpa(G* + tAe) + Reie* + tAQ) - Lpa(e*) - i?e(e*) 

> ^ (\myy)s Ji - m^yy)sji) + ^ f l(A^^,:.)5,Ji " 3|(A0,,)5,. 



2 yy/Oyy\^ -i\ yy/^yy\^j ' 2 

+/3(G*;ro,a)||Ae||?, 

> -L5max{A„,p„}|(AG)5|i +/3o||Ae||| 

> -L5co7nV^||AG||,r + /3o||AG|||, 
which implies that 

II AQIIi. < L5co/3o'S„^ = A„. 

Since A„ < tq, we claim that t = \ and thus AG = AG. Indeed, if otherwise t < 1, then 
1 1 AG 1 1 i? = ro > A^ which contradicts the above inequality. This completes the proof. □ 
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Table B.l: Comparison of average CPU run times and average matrix losses and F-scores for 
synthetic datasets over 50 replications. In this experiment, we fix n = 100 and p = 50. 

Methods g = 50 q = 100 q = 200 q = 500 



CPU Time (sec.) i 



pLiVjiVi 


n 1 7 

U.i ( 


n 9R 


n AP. 

U.4D 




n Qs 
u.yo 




99 


98 


45 




1 09 


(crL/aSSO 


U.40 


i.Oi 


o.uz 




1 c^n OQ 


IN oi-iasso 




9 "if! 


0. 14 




O.oo 






Operator norm 


■ ll^* — ||2 


1 




pVjVjiVi 


n Qfi (C\ C\A\ 

u.yo (^u.U4j 


1 riR ("n 

i.UD l^U.Uoj 


1 1 7 An riQ^ 




1 9Q An n9'\ 


CLxljiVi 


n QQ (C\ C\A\ 


1 n7 ^"0 (\A\ 
L.xJI [y.yj^j 


1 1 s An n'?^ 
i.io ^^u.uoJ 




1 9*? An n9'\ 


GLasso 




i.44 (^U.U ( j 


1 71 An n7\ 
i. Ai (U.UA j 




9 Q 1 An n/) ^ 
z.oi (^U.U4J 


v\ oijasso 
















Matrix ^i-norm 


L lie - e*||i 


1 




pLrLriVl 


Z.Oi (U.izj 


1.98 (0.23) 


1.81 (0.11) 




1 in An 1 n\ 
1.10 (O.lOj 


cLrLiM 


z.oo (O.ibj 


2.13 (0.20) 


1.89 (0.06) 




1 in An 1 n\ 
1.10 (O.lOj 


GLasso 


9 on ^"0 90^ 

z.yu (^u.zuj 


3.03 (0.32) 


3.11 (0.21) 




Q on An '^9'> 


NSLasso 
















Frobenius norm 


- ||e-e*||i7 


•i 




pGGM 


3.36 (0.07) 


3.91 (0.11) 


4.81 (0.12) 




4.58 (0.04) 


cGGM 


3.43 (0.07) 


3.96 (0.12) 


4.85 (0.13) 




4.59 (0.04) 


GLasso 


4.58 (0.11) 


5.94 (0.06) 


7.89 (0.08) 




12.22 (0.03) 


NSLasso 
















Support Recovery F-score 


t 




pGGM 


0.41 (0.01) 


0.37 (0.01) 


0.35 (0.01) 




0.23 (0.01) 


cGGM 


0.33 (0.01) 


0.31 (0.01) 


0.32 (0.01) 




0.23 (0.01) 


GLasso 


0.31 (0.01) 


0.27 (0.01) 


0.27 (0.01) 




0.22 (0.01) 


NSLasso 


0.40 (0.01) 


0.35 (0.01) 


0.32 (0.01) 




0.21 (0.01) 



B Additional Materials on Monte Carlo Simulations 

In this appendix section, we provide the detailed performance figures on the synthetic data as 
described in Section 16.11 For support recovery, we use F-score. We also measure the precision 
matrix estimation quality by three matrix norms: the operator norm, the matrix £i-norm, and the 
Frobenius norm. The results are presented in Table IB. II and Table IB. 21 . 
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Table B.2: Comparison of average CPU run times and average matrix losses and F-scores for 
synthetic datasets over 50 replications. Here we fix n = 100 and p = 50. 



Aid hods 




Q -- 


= 50 


q = 100 q = 500 


Q = 


1000 










CPU Time i 






pGGM 




0.17 


0.26 0.46 


0.98 


GLasso-M 




0.04 


0.05 0.05 


0.05 










Operator norm \\^yy — ^^yy 2 i 






pGGM 


0, 


.76 


(0.04) 


0.86 (0.07) 0.91 (0.06) 


0.58 


(0.01) 


GLasso-M 


0. 


.88 


(0.06) 


0.86 (0.09) 0.88 (0.03) 
Matrix ^i-norm | l^^^j/ — f^^^ 1 i 


0.86 


(0.02) 


pGGM 


1. 


.94 


(0.12) 


1.94 (0.26) 1.879 (0.13) 


0.94 


(0.03) 


GLasso-M 


2. 


.80 


(0.18) 


2.87 (0.29) 2.76 (0.08) 

Frobcnius norm \\^yy — VtyyWp i 


1.93 


(0.08) 


pGGM 


2, 


.55 


(0.08) 


2.68 (0.12) 3.17 (0.15) 


2.18 


(0.06) 


GLasso-M 


3, 


.14 


(0.09) 


3.11 (0.09) 3.26 (0.05) 
Support Recovery F-score t 


3.03 


(0.04) 


pGGM 


0, 


,42 


(0.01) 


0.38 (0.02) 0.39 (0.02) 


0.30 


(0.01) 


GLasso-M 


0, 


.31 


(0.01) 


0.28 (0.01) 0.27 (0.01) 


0.27 


(0.01) 
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(a) CorelBk, ^l = 0.1. Method(# Links): pGGM (677), NSLasso (293), GLasso (909), GLasso-M (1153). 






(b) MIRFIicker25k, n = 0.1. Method(# Links): pGGM (409), NSLasso (110), GLasso (573), GLasso-M 
(960). 





(c) RCV1-V2. // = 0.1. Mcthod(# Links): pGGM (87). NSLasso (156), GLasso (282), GLasso-M (688). 






(d) S&P500, 11 = 0.05. Method(# Links): pGGM (136), NSLasso (94), GLasso (160), GLasso-M (221). 
Figure 6.4: Constructed graphs by pGGM, NSLasso, GLasso and GLasso-M. 
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pGGM 



NSLasso 



GLaeso 



GLasso-M 




(b) MIRFIicker25k. 



pGGM NSLasso GLasso GLasso-M 




(d) S&P500. 

Figure 6.5: The top 50 links in the constructed graphs by pGGM, NSLasso, GLasso and GLasso-M. 
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